Search CORE

188 research outputs found

Analyse morphologique non supervisée en domaine biomédical. Application à la recherche d'information

Author: Claveau Vincent
Kijak Ewa
Publication venue: ATALA (Association pour le Traitement Automatique des Langues)
Publication date: 08/10/2013
Field of study

International audienceDans le domaine biomédical, utiliser des termes spécialisés est essentiel pour accéder à l'information. Cependant, dans beaucoup de langues, ces termes sont des constructions morphologiques complexes qui compliquent cet accès à l'information. Dans cet article, nous nous intéressons à l'identiﬁcation des composants morphologiques de ces termes et à leur utilisation pour une tâche de recherche d'information (RI). Nous proposons différentes approches reposant sur un alignement automatique avec une langue pivot particulière, le japonais, et sur un apprentissage par analogie permettant de produire des analyses morphologiques ﬁnes des termes d'une langue donnée. Ces analyses morphologiques sont ensuite utilisées pour améliorer l'indexation de documents biomédicaux. Les expériences rapportées montrent la validité de cette approche avec des gains en MAP de plus de 10 % par rapport à un système de RI standard

HAL-CentraleSupelec

INRIA a CCSD electronic archive server

HAL-Rennes 1

Topic segmentation of TV-streams by watershed transform and vectorization

Author: Claveau Vincent
Lefèvre Sébastien
Publication venue: 'Elsevier BV'
Publication date: 01/01/2015
Field of study

International audienceA fine-grained segmentation of Radio or TV broadcasts is an essential step for most multimedia processing tasks. Applying segmentation algorithms to the speech transcripts seems straightforward. Yet, most of these algorithms are not suited when dealing with short segments or noisy data. In this paper, we present a new segmentation technique inspired from the image analysis field and relying on a new way to compute similarities between candidate segments called Vectorization. Vectorization makes it possible to match text segments that do not share common words; this property is shown to be particularly useful when dealing with transcripts in which transcription errors and short segments makes the segmentation difficult. This new topic segmen-tation technique is evaluated on two corpora of transcripts from French TV broadcasts on which it largely outperforms other existing approaches from the state-of-the-art

HAL-CentraleSupelec

INRIA a CCSD electronic archive server

HAL-Rennes 1

Inferring syntactic rules for word alignment through Inductive Logic Programming

Author: Claveau Vincent
Ozdowska Sylwia
Publication venue: HAL CCSD
Publication date: 19/05/2010
Field of study

International audienceThis paper presents and evaluates an original approach to automatically align bitexts at the word level. It relies on a syntactic dependency analysis of the source and target texts and is based on a machine-learning technique, namely inductive logic programming (ILP). We show that ILP is particularly well suited for this task in which the data can only be expressed by (translational and syntactic) relations. It allows us to infer easily rules called syntactic alignment rules. These rules make the most of the syntactic information to align words. A simple bootstrapping technique provides the examples needed by ILP, making this machine learning approach entirely automatic. Moreover, through different experiments, we show that this approach requires a very small amount of training data, and its performance rivals some of the best existing alignment systems. Furthermore, cases of syntactic isomorphisms or non-isomorphisms between the source language and the target language are easily identified through the inferred rules

HAL-CentraleSupelec

Scientific Publications of the University of Toulouse II Le Mirail

INRIA a CCSD electronic archive server

HAL-Rennes 1

La phonétisation comme un problème de translittération

Author: Claveau Vincent
Publication venue: HAL CCSD
Publication date: 24/06/2009
Field of study

International audiencePhonetizing is a crucial step to process oral documents. In this paper, a new word-based phonetization approach is proposed ; it is automatic, simple, portable and efficient. It relies on machine learning ; thus, the system is built from examples of words with their pho- netic representations. More precisely, it makes the most of a technique inferring rewriting rules initially developed for transliteration and translation. In order to evaluate the performances of this approach, we used several datasets from the Pronalsyl Pascal challenge, including different languages. The obtained results equal or outperform those of the best known systems.La phonétisation est une étape essentielle pour le traitement de l'oral. Dans cet article, nous décrivons un système automatique de phonétisation de mots isolés qui est simple, portable et performant. Il repose sur une approche par apprentissage ; le système est donc construit à partir d'exemples de mots et de leur représentation phonétique. Nous utili- sons pour cela une technique d'inférence de règles de réécriture initialement développée pour la translittération et la traduction. Pour évaluer les performances de notre approche, nous avons utilisé plusieurs jeux de données couvrant différentes langues et divers alphabets phonétiques, tirés du challenge Pascal Pronalsyl. Les très bons résultats obtenus égalent ou dépassent ceux des meilleurs systèmes de l'état de l'art

INRIA a CCSD electronic archive server

Dimensionnalité intrinsèque dans les espaces de représentation des termes et des documents

Author: Claveau Vincent
Publication venue: HAL CCSD
Publication date: 09/03/2016
Field of study

National audienceExamining the properties of representation spaces for documents or words in IR (typically R n with n large) brings precious insights to help the retrieval process. Recently, several authors have studied the real dimensionality of the datasets, called intrinsic dimensionality, in specific parts of these spaces (Houle et al., 2012a). In this paper, we propose to revisit this notion through a coefficient called α in the specific case of IR and to study its use in IR tasks. More precisely, we show how to estimate α from IR similarities and to use it in representation spaces used for documents and words (Mikolov et al., 2013 ; Claveau et al., 2014). Indeed, we prove that α may be used to characterize difficult queries; moreover we show that this intrinsic dimensionality notion, applied to words, can help to choose terms to use for query expansion.L'examen des propriétés des espaces de représentation des documents ou des mots en RI (typiquement, R n avec n très grand) fournit de précieuses indications pour aider la re-cherche. Récemment, plusieurs travaux ont montré qu'il était possible d'étudier la dimensionnalité réelle des données, appelée dimensionnalité intrinsèque, en certains points de ces espaces (Houle et al., 2012a). Dans cet article, nous proposons de revisiter cette notion de dimension intrinsèque sous la forme d'un indice noté α dans le cas particulier de la RI et d'étudier son utilisation pratique en RI. Plus précisément, nous montrons comment son estimation à partir de similarités de type RI, peut être utilisée dans les espaces de représentations des documents et les espaces de représentations de mots (Mikolov et al., 2013 ; Claveau et al., 2014). Ainsi, nous montrons d'une part que l'indice α aide à caractériser les requêtes difficiles ; d'autre part, dans une tâche d'extension de requête, nous montrons comment cette notion de dimensionnalité intrinsèque appliquée à des mots permet de choisir au mieux les termes à étendre et leurs extensions

HAL-CentraleSupelec

INRIA a CCSD electronic archive server

HAL-Rennes 1

PPL-MCTS: Constrained Textual Generation Through Discriminator-Guided MCTS Decoding

Author: Chaffin Antoine
Claveau Vincent
Kijak Ewa
Publication venue
Publication date: 16/11/2021
Field of study

Large language models (LM) based on Transformers allow to generate plausible long texts. In this paper, we explore how this generation can be further controlled at decoding time to satisfy certain constraints (e.g. being non-toxic, conveying certain emotions, using a specific writing style, etc.) without fine-tuning the LM. Precisely, we formalize constrained generation as a tree exploration process guided by a discriminator that indicates how well the associated sequence respects the constraint. This approach, in addition to being easier and cheaper to train than fine-tuning the LM, allows to apply the constraint more finely and dynamically. We propose several original methods to search this generation tree, notably the Monte Carlo Tree Search (MCTS) which provides theoretical guarantees on the search efficiency, but also simpler methods based on re-ranking a pool of diverse sequences using the discriminator scores. These methods are evaluated, with automatic and human-based metrics, on two types of constraints and languages: review polarity and emotion control in French and English. We show that discriminator-guided MCTS decoding achieves state-of-the-art results without having to tune the language model, in both tasks and languages. We also demonstrate that other proposed decoding methods based on re-ranking can be really effective when diversity among the generated propositions is encouraged.Comment: 15 pages, 5 tables, 7 figures, accepted to NAACL 202

arXiv.org e-Print Archive

HAL-CentraleSupelec

INRIA a CCSD electronic archive server

HAL Descartes

Hal-Diderot

HAL-Rennes 1

Detecting fake news in tweets from text and propagation graph: IRISA's participation to the FakeNews task at MediaEval 2020

Author: Claveau Vincent
Publication venue: HAL CCSD
Publication date: 14/12/2020
Field of study

International audienceThis paper presents the participation of IRISA to the task of fake news detection from tweets, relying either on the text or on propagation information. For the text based detection, variants of BERT-based classification are proposed. In order to improve this standard approach, we investigate the interest of augmenting the dataset by creating tweets with fine-tuned generative models. For the graph based detection, we have proposed models characterizing the propagation of the news or the users' reputation

HAL-CentraleSupelec

INRIA a CCSD electronic archive server

HAL-Rennes 1

Speculation and negation detection in french biomedical corpora

Author: Claveau Vincent
Dalloux Clément
Grabar Natalia
Publication venue: HAL CCSD
Publication date: 02/09/2019
Field of study

International audienceIn this work, we propose to address the detection of negation and speculation, and of their scope, in French biomedical documents. It has been indeed observed that they play an important role and provide crucial clues for other NLP applications. Our methods are based on CRFs and BiLSTM. We reach up to 97.21 % and 91.30 % F-measure for the detection of negation and speculation cues, respectively , using CRFs. For the computing of scope, we reach up to 90.81 % and 86.73 % F-measure on negation and speculation , respectively, using BiLSTM-CRF fed with word embeddings

HAL-CentraleSupelec

Crossref

INRIA a CCSD electronic archive server

HAL-Rennes 1

Measuring vagueness and subjectivity in texts: from symbolic to neural VAGO

Author: Atemezing Ghislain
Claveau Vincent
Icard Benjamin
Égré Paul
Publication venue
Publication date: 23/10/2023
Field of study

We present a hybrid approach to the automated measurement of vagueness and subjectivity in texts. We first introduce the expert system VAGO, we illustrate it on a small benchmark of fact vs. opinion sentences, and then test it on the larger French press corpus FreSaDa to confirm the higher prevalence of subjective markers in satirical vs. regular texts. We then build a neural clone of VAGO, based on a BERT-like architecture, trained on the symbolic VAGO scores obtained on FreSaDa. Using explainability tools (LIME), we show the interest of this neural version for the enrichment of the lexicons of the symbolic version, and for the production of versions in other languages.Comment: Paper to appear in the Proceedings of the 2023 IEEE International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT

arXiv.org e-Print Archive

Supervised Machine Learning Techniques to Detect TimeML Events in French and English

Author: Arnulphy Béatrice
Claveau Vincent
Tannier Xavier
Vilnat Anne
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 17/06/2015
Field of study

International audienceIdentifying events from texts is an information extraction task necessary for many NLP applications. Through the TimeML specifications and TempEval challenges, it has received some attention in the last years; yet, no reference result is available for French. In this paper, we try to fill this gap by proposing several event extraction systems, combining for instance Conditional Random Fields, language modeling and k-nearest-neighbors. These systems are evaluated on French corpora and compared with state-of-the-art methods on English. The very good results obtained on both languages validate our whole approach

HAL-CentraleSupelec

INRIA a CCSD electronic archive server

HAL-Rennes 1